程序代写代做代考 decision tree algorithm http://poloclub.gatech.edu/cse6242

http://poloclub.gatech.edu/cse6242 
CSE6242 / CX4242: Data & Visual Analytics 
Ensemble Methods  (Model Combination)
Duen Horng (Polo) Chau 
Assistant Professor 
Associate Director, MS Analytics  Georgia Tech
Partly based on materials by  
Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Parishit Ram (GT PhD alum; SkyTree), Alex Gray
Parishit Ram  
GT PhD alum; SkyTree

Numerous Possible Classifiers!
Classifier Training Cross  Testing time Accuracy time validation
kNN  None Can be slow Slow ?? classifier
Decision  Slow Very slow Very fast ?? trees
Naive  Fast None Fast ?? Bayes 
classifier
……………
2

3
Which Classifier/Model to Choose?
Possible strategies:
• Go from simplest model to more complex model until you obtain desired accuracy
• Discover a new model if the existing ones do not work for you
• Combine all (simple) models

Common Strategy: Bagging   (Bootstrap Aggregating)
Consider the data set S = {(xi, yi)}i=1,..,n
• Pick a sample S* with replacement of size n
• Train on S* to get a classifier f*
•
Repeat above steps B times to get f1, f2,…,fB • Final classifier f(x) = majority{fb(x)}j=1,…,B
http://statistics.about.com/od/Applications/a/What-Is-Bootstrapping.htm 4

Common Strategy: Bagging
Why would bagging work?
•
When would this be useful?
• 5
Combining multiple classifiers reduces the variance of the final classifier
We have a classifier with high variance

Bagging decision trees
Consider the data set S
• •
• •
Pick a sample S* with replacement of size n GrowadecisiontreeT greedily
b Repeat B times to get T1,…,TB
The final classifier will be
6

Random Forests
Almost identical to bagging decision trees,   except we introduce some randomness:
• •
7
Randomly pick m of the d attributes available Grow the tree only using those m attributes
Bagged random decision trees = Random forests

Points about random forests
Algorithm parameters
• •
8
Usual values for m:
Usual value for B: keep increasing B until the training error stabilizes

9
Explicit CV not necessary
• Unbiased test error can be estimated using
• out-of-bag data points (OOB error estimate)
You can still do CV explicitly, but that’s not necessary, since research shows that OOB estimate is as accurate
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr http://stackoverflow.com/questions/18541923/what-is-out-of-bag-error-in-random-forests

10
Final words
Advantages
• Efficient and simple training
• Allows you to work with simple classifiers
• Random-forests generally useful and accurate in practice (one of the best classifiers)
• Embarrassingly parallelizable Caveats:
• Needs low-bias classifiers
• Can make a not-good-enough classifier worse

Final words
Reading material
• •
Bagging: ESL Chapter 8.7
Random forests: ESL Chapter 15 
http://www-stat.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf
11

Related Posts